import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
import phik
import plotly.express as px
plt.style.use('ggplot')
ls
Volume in drive C has no label.
Volume Serial Number is 40F4-F4CD
Directory of C:\Users\DELL\Documents\Binus\Semester 4\ML\UAS ML
09/07/2022 20:55 <DIR> .
09/07/2022 20:55 <DIR> ..
06/07/2022 20:57 <DIR> .ipynb_checkpoints
06/07/2022 19:55 52.532 CustomerMarketingDataset.csv
09/07/2022 20:55 6.047.166 UAS ML.ipynb
06/07/2022 16:25 201.683 UAS ML.pdf
3 File(s) 6.301.381 bytes
3 Dir(s) 218.792.570.880 bytes free
%matplotlib inline
df = pd.read_csv("CustomerMarketingDataset.csv")
df
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | AmountSpent | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Young | Male | Rent | Single | Close | 15000 | 3 | Low | 6 | Low |
| 1 | Young | Male | Rent | Single | Close | 13000 | 3 | Low | 6 | Low |
| 2 | Young | Female | Rent | Single | Close | 14600 | 3 | Low | 6 | Low |
| 3 | Young | Female | Rent | Single | Close | 17900 | 3 | Low | 6 | Low |
| 4 | Old | Female | Own | Single | Close | 12700 | 2 | Low | 6 | Low |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Middle | Female | Own | Married | Far | 99200 | 0 | High | 24 | Very High |
| 996 | Old | Female | Own | Married | Far | 110000 | 0 | High | 24 | Very High |
| 997 | Middle | Female | Rent | Married | Far | 120800 | 1 | High | 24 | Very High |
| 998 | Middle | Male | Own | Married | Far | 123000 | 1 | High | 24 | Very High |
| 999 | Old | Male | Own | Married | Far | 112900 | 0 | High | 24 | Very High |
1000 rows × 10 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1000 non-null object 1 Gender 1000 non-null object 2 OwnHome 1000 non-null object 3 Married 1000 non-null object 4 Location 1000 non-null object 5 Salary 1000 non-null int64 6 Children 1000 non-null int64 7 History 697 non-null object 8 Catalogs 1000 non-null int64 9 AmountSpent 1000 non-null object dtypes: int64(3), object(7) memory usage: 78.2+ KB
df.duplicated().any()
True
df[df.duplicated()]
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | AmountSpent | |
|---|---|---|---|---|---|---|---|---|---|---|
| 117 | Young | Male | Rent | Single | Close | 17600 | 0 | NaN | 6 | Low |
| 145 | Young | Female | Rent | Single | Far | 18400 | 1 | Low | 6 | Low |
From the looks above the dataset does not contain any duplicated row
df[df['History'].isna()]
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | AmountSpent | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10 | Young | Male | Rent | Single | Close | 11200 | 0 | NaN | 6 | Low |
| 23 | Young | Female | Rent | Single | Close | 14100 | 3 | NaN | 6 | Low |
| 24 | Young | Female | Rent | Single | Close | 11500 | 3 | NaN | 6 | Low |
| 30 | Young | Female | Rent | Single | Close | 13700 | 1 | NaN | 6 | Low |
| 39 | Young | Male | Rent | Single | Close | 19100 | 0 | NaN | 6 | Low |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 974 | Middle | Male | Own | Married | Close | 105300 | 1 | NaN | 18 | Very High |
| 975 | Middle | Male | Own | Married | Close | 123800 | 2 | NaN | 18 | Very High |
| 984 | Middle | Male | Own | Married | Close | 96300 | 2 | NaN | 18 | Very High |
| 985 | Middle | Male | Own | Married | Far | 118400 | 2 | NaN | 12 | Very High |
| 992 | Middle | Male | Own | Married | Close | 122700 | 1 | NaN | 18 | Very High |
303 rows × 10 columns
There are many rows with missing data in History columns, knowing that these are customers data, means the missing value is not intentional. Sometimes the data is NaN because it is intentionally left like that. For example, if the customer has never make a purchase at Bee Department Store, then maybe the NaN value is to indicate that they have no purchase history. However, there is AmountSpent column that indicates how much they have spent at Bee Department Store which means all of the customer here had made a purchase at Bee Department Store.
cat = []
num = []
for col in df.columns:
if df[col].dtype == 'O':
cat.append(col)
else:
num.append(col)
df[cat].describe()
| Age | Gender | OwnHome | Married | Location | History | AmountSpent | |
|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 1000 | 1000 | 1000 | 697 | 1000 |
| unique | 3 | 2 | 2 | 2 | 2 | 3 | 4 |
| top | Middle | Female | Own | Married | Close | High | Medium |
| freq | 508 | 506 | 516 | 502 | 710 | 255 | 251 |
df[num].describe()
| Salary | Children | Catalogs | |
|---|---|---|---|
| count | 1000.000000 | 1000.00000 | 1000.000000 |
| mean | 56103.900000 | 0.93400 | 14.682000 |
| std | 30616.314826 | 1.05107 | 6.622895 |
| min | 10100.000000 | 0.00000 | 6.000000 |
| 25% | 29975.000000 | 0.00000 | 6.000000 |
| 50% | 53700.000000 | 1.00000 | 12.000000 |
| 75% | 77025.000000 | 2.00000 | 18.000000 |
| max | 168800.000000 | 3.00000 | 24.000000 |
plt.figure(figsize = (16,4))
for i in range(len(num)):
plt.subplot(1, round(len(num)/1), i+1)
sns.boxplot(y = df[num[i]], orient = 'v')
plt.tight_layout()
plt.figure(figsize = (16,4))
for i in range(len(num)):
plt.subplot(1, round(len(num)/1), i+1)
sns.kdeplot(x = df[num[i]])
plt.tight_layout()
df.Catalogs.value_counts()
12 282 6 252 18 233 24 233 Name: Catalogs, dtype: int64
From the visualization above we know that:
def show_val(ax):
for b in ax.patches:
x = b.get_x() + (b.get_width() / 2)
y = b.get_y() + b.get_height() + (b.get_height() * 0.01)
val = b.get_height()
ax.text(x, y, val, ha = 'center')
def show_val_percent(ax):
for b in ax.patches:
x = b.get_x() + (b.get_width() / 2)
y = b.get_y() + b.get_height() + (b.get_height() * 0.01)
val = f'{b.get_height():.2f}%'
ax.text(x, y, val, ha = 'center')
def show_val_int(ax):
for b in ax.patches:
x = b.get_x() + (b.get_width() / 2)
y = b.get_y() + b.get_height() + (b.get_height() * 0.01)
val = int(b.get_height())
ax.text(x, y, val, ha = 'center')
def group_df_uni(col):
ndf = df.groupby(col).size().reset_index().rename(columns = {0:'count'})
ndf['percentage'] = (ndf['count']/df.shape[0]) * 100
return ndf
def plot_bar_uni(col):
df1 = group_df_uni(col)
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (16,5))
sns.barplot(x = col, y= 'count', data = df1, ax = ax[0], ci = None)
ax[0].set_ylabel('Total Customer')
show_val_int(ax[0])
sns.barplot(x = col, y= 'percentage', data = df1, ax = ax[1], ci = None)
show_val_percent(ax[1])
plot_bar_uni('Age')
plot_bar_uni('Gender')
plot_bar_uni('OwnHome')
plot_bar_uni('Married')
plot_bar_uni('Location')
plot_bar_uni('History')
plot_bar_uni('AmountSpent')
From the visualization above we can get information that:
ProfileReport(df, title="Bee Department Store Customer Profile Report")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Because we are going to compare the relationship between categorical and continuous variable we should use Phik correlation. Not only it can depicts the relationship between categorical and continuos variable, it can also cath a non-linear relationship between variables.
plt.figure(figsize = (14,8))
sns.heatmap(df.phik_matrix(), annot = True, fmt = '.2f')
plt.tight_layout()
interval columns not set, guessing: ['Salary', 'Children', 'Catalogs']
px.box(df, x = 'Age', y = 'Salary', template = 'ggplot2', color = 'Age')
As we can see from the heatmap the customer's age have a big correlation (0.66) to their salary.
From the boxplot above we can see that the salary does not have a linear correlation. It has a curvy correlation, starting with the young being smallest then the peak is when the middle age, and it goes down a bit after reaching the old age. The young age has median salary of 21,400, while the middle age has 68,450 and the old has 54,600.
px.box(df, x = 'Married', y = 'Salary', template = 'ggplot2', color = 'Married')
From the heatmap, we can see that the customer's marital status has a very high correlation (0.85) with their salary.
Being married has a positive correlation witht the customer's salary. Because from the customer's data we can see that the median salary for the one that is married is 2 times bigger than the one that is single. The median salary for married group is 76,700 while the single group is 33,100.
px.box(df, x = 'AmountSpent', y = 'Salary', template = 'ggplot2', color = 'AmountSpent')
From the heatmap we can see that the customer's salary has a high correlation with how much their amount of spent at Bee Department Store. It has a positive correlation because the higher their amount of spent at Bee Department Store means the higher their salary is.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
Things need to be done in preprocessing from the EDA:
for col in cat:
print(col)
print(df[col].unique())
print('')
Age ['Young' 'Old' 'Middle'] Gender ['Male' 'Female'] OwnHome ['Rent' 'Own'] Married ['Single' 'Married'] Location ['Close' 'Far'] History ['Low' nan 'Medium' 'High'] AmountSpent ['Low' 'Medium' 'High' 'Very High']
mapping_age = {
'Young':0, 'Middle':1, 'Old':1
}
mapping_gender = {
'Male':0, 'Female':1
}
mapping_ownHome = {
'Rent':0, 'Own':1
}
mapping_married = {
'Single':0, 'Married':1
}
mapping_location = {
'Close':0, 'Far':1
}
mapping_history = {
'Low':0, 'Medium':1, 'High':2
}
mapping_amountSpent = {
'Low':0, 'Medium':1, 'High':2, 'Very High':3
}
maps = [mapping_age, mapping_gender, mapping_ownHome, mapping_married, mapping_location, mapping_history, mapping_amountSpent]
cat
['Age', 'Gender', 'OwnHome', 'Married', 'Location', 'History', 'AmountSpent']
def label_encoder(col, mapping):
df[col] = df[col].map(mapping)
for i in range(len(cat)):
df[cat[i]] = df[cat[i]].map(maps[i])
df.head(3)
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | AmountSpent | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 15000 | 3 | 0.0 | 6 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 13000 | 3 | 0.0 | 6 | 0 |
| 2 | 0 | 1 | 0 | 0 | 0 | 14600 | 3 | 0.0 | 6 | 0 |
Now the features in already on numerical value, the next step should be imputing the missing data. however, because we want to use iterativeImputer we should split the data first into X and y so there will not be any data leakage because the feature will be not filled using the target data as the input for the predictor.
We will use IterativeImputer because there is no information given about the missing dataset. Iterative Imputer will calculate and predict each missing data using other features that is not missing. Hence, the data that is filled will be more accurate.
X = df.drop('AmountSpent', axis = 1)
y = df['AmountSpent']
IterImputer = IterativeImputer(random_state = 42)
IterImputer.fit(X)
X_transformed = IterImputer.transform(X)
X_trans = pd.DataFrame(X_transformed, columns = X.columns)
X_trans['History'].describe()
count 1000.000000 mean 0.975008 std 0.805053 min -0.907033 25% 0.000000 50% 1.000000 75% 2.000000 max 2.653011 Name: History, dtype: float64
Because there are some values that is out of range and decimal value, we will round it then put it in range
def put_in_range(history):
val = round(history)
if val < 0:
return 0
elif val > 2:
return 2
else:
return val
X_trans['History'] = X_trans['History'].apply(put_in_range)
X_trans['History'].value_counts()
1 361 0 331 2 308 Name: History, dtype: int64
Now the value is already in the correct range
Because the dataset is small we will maximize the train data size so the model can be trained on more data.
X_train, X_test, y_train, y_test = train_test_split(X_trans, y, test_size=0.2, random_state=42, stratify = y)
To prevent data leak, we split the train and test data first before doing the normalization using MinMaxScaler. The Scaler will be only trained on the train data to prevent the data leak.
X_train_norm = X_train.copy()
X_test_norm = X_test.copy()
for i in X_train.columns:
# fit on training data column
scale = MinMaxScaler().fit(X_train_norm[[i]])
# transform the training data column
X_train_norm[i] = scale.transform(X_train_norm[[i]])
# transform the testing data column
X_test_norm[i] = scale.transform(X_test_norm[[i]])
X_train_norm.describe()
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | |
|---|---|---|---|---|---|---|---|---|---|
| count | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 |
| mean | 0.706250 | 0.497500 | 0.540000 | 0.508750 | 0.291250 | 0.290770 | 0.315833 | 0.490000 | 0.478750 |
| std | 0.455764 | 0.500307 | 0.498709 | 0.500236 | 0.454623 | 0.193989 | 0.350972 | 0.402464 | 0.367795 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120353 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.275047 | 0.333333 | 0.500000 | 0.333333 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.421708 | 0.666667 | 1.000000 | 0.666667 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Now all of the features have the same range of value (from 0 to 1). It will help some model (such as logistic regression) in predicting the target
X_train= X_train_norm
X_test = X_test_norm
We can select the feature that we will use for the prediction using their correlation.
plt.figure(figsize = (14,8))
sns.heatmap(df.phik_matrix(), annot = True, fmt = '.2f')
plt.tight_layout()
interval columns not set, guessing: ['Age', 'Gender', 'OwnHome', 'Married', 'Location', 'Salary', 'Children', 'History', 'Catalogs', 'AmountSpent']
From the heatmap above we can see that all of the features have a decent correlation to the target. Therefore, we should select all of them to be the features for our model. However, we can see further what features we should use and drop after the model has been trained because we can use feature importance and shap value to know which feature has a big role on predicting the target.
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import shap
Because the classes are balanced in the dataset we can use accuracy as the performance metrics. In addition, we should also use f1_score which calculated using harmonic mean of precision and recall to ensure the performance of the model is good to predict all of the classes, because this is a multi-class classification we need the average for the f1-score. Hence, we will use "macro" as the average parameter because the classes are balanced
For the baseline model, we choose DummyClassifier, baseline model is useful as a control group because it only predict based on simple rule of thumb. Baseline models will show us how good other models that have been trained compared to it.
Baseline = DummyClassifier(random_state = 42)
Baseline.fit(X_train, y_train)
y_pred = Baseline.predict(X_test)
print(classification_report(y_test, y_pred, zero_division = 0))
precision recall f1-score support
0 0.00 0.00 0.00 50
1 0.25 1.00 0.40 50
2 0.00 0.00 0.00 50
3 0.00 0.00 0.00 50
accuracy 0.25 200
macro avg 0.06 0.25 0.10 200
weighted avg 0.06 0.25 0.10 200
I choose 3 models for this dataset, one simple model and two ensemble models. For the ensemble models there are two types of model that I pick and those are bagging and boosting. I choose these three models to predict the target because they each have their own characteristics which will show me the right model for this dataset.
models = [KNeighborsClassifier(),
RandomForestClassifier(random_state = 42),
XGBClassifier(random_state = 42, use_label_encoder=False, verbosity = 0)]
names = ['KNN Classifier',
'Random Forest Classifier',
'XGBoost Classifier']
def model_score(model, name):
print(f'{name}:')
print('Train score:')
model = model
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
print(f'Accuracy Score = {accuracy_score(y_train, y_pred_train)}')
print(f"F1 Score = {f1_score(y_train, y_pred_train, average = 'macro')}")
print('Test score:')
y_pred = model.predict(X_test)
print(f'Accuracy Score = {accuracy_score(y_test, y_pred)}')
print(f"F1 score = {f1_score(y_test, y_pred, average = 'macro')}")
print('\n')
for i in range(len(models)):
model_score(models[i], names[i])
KNN Classifier: Train score: Accuracy Score = 0.745 F1 Score = 0.743603496412225 Test score: Accuracy Score = 0.62 F1 score = 0.6134544804431354 Random Forest Classifier: Train score: Accuracy Score = 0.99875 F1 Score = 0.998746835681577 Test score: Accuracy Score = 0.695 F1 score = 0.6884943460567163 XGBoost Classifier: Train score: Accuracy Score = 0.99875 F1 Score = 0.998746835681577 Test score: Accuracy Score = 0.68 F1 score = 0.6791112000800721
For the model training score KNeighborsClassifier is a quite good while RandomForestClassifier and XGBClassifier results are superb with all of the score close to 1.0 that means they can accurately predict the target. On the other hand, all of their prediction on the test data are bad. All of them have a score under 0.7 that means the models are overfitted.
For the hyper-parameters tuning we will use RandomizedSearchCV which randomly pick a combination of hyperparameters then test it on the data using cross validation (CV). cross validation is a method of splitting the data into n numbers and then use one chunks of data as the test data and the rest as the training data and this process is repeated until all slice of the data has been used as the test data. Different from the GridSearchCV which test all of the possible combinations, RandomizedSearchCV only pick several n combinations. After testing n hyper-parameters combination the RandomizedSearchCV will then give the best hyper-parameters combination based on the scoring method that we have passed.
def plusNone(arr):
return np.array(arr.tolist() + [None])
params = [
{ # KNearestNeighbor
'leaf_size': list(range(1,50)),
'n_neighbors': list(range(1,30)),
'p': [1,2],
'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute']
},
{#Random Forest
'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
},
{ # XGBoost
'min_child_weight' : [int(x) for x in np.linspace(1, 20, num = 5)],
'gamma' : [float(x) for x in np.linspace(0, 1, num = 5)],
'tree_method' : ['auto', 'exact', 'approx', 'hist'],
'colsample_bytree' : [float(x) for x in np.linspace(0, 1, num = 5)],
'eta' : [float(x) for x in np.linspace(0, 1, num = 10)],
'lambda' : [float(x) for x in np.linspace(0, 1, num = 5)],
'alpha' : [float(x) for x in np.linspace(0, 1, num = 5)]
},
]
Here is the description of each hyperparameters that I use for each model:
KNeighborsClassifier
RandomForestClassifier
XGBClassifier
def model_score_tune_rscv(model, name, params, n_iter = 10, verbose = 0):
print(f'{name}:')
RSCV = RandomizedSearchCV(model, params, cv = 3, n_jobs = -1, verbose = verbose, scoring = 'f1_macro', random_state = 42, n_iter = n_iter)
RSCV.fit(X_train, y_train)
y_pred_train = RSCV.predict(X_train)
print('Train score:')
print(f'Accuracy Score = {accuracy_score(y_train, y_pred_train)}')
print(f"F1 Score = {f1_score(y_train, y_pred_train, average = 'macro')}")
print('Test score:')
y_pred = RSCV.predict(X_test)
print(f'Accuracy Score = {accuracy_score(y_test, y_pred)}')
print(f"F1 score = {f1_score(y_test, y_pred, average = 'macro')}")
print('\n')
print(RSCV.best_params_)
print('\n')
For the RandomizedSearchCV there are several parameters that i used:
for i in range(3):
model_score_tune_rscv(models[i], names[i], params[i], 1000, verbose = 1)
KNN Classifier:
Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
Train score:
Accuracy Score = 0.6875
F1 Score = 0.6782571475371831
Test score:
Accuracy Score = 0.665
F1 score = 0.6620029693004044
{'p': 1, 'n_neighbors': 21, 'leaf_size': 32, 'algorithm': 'ball_tree'}
Random Forest Classifier:
Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
Train score:
Accuracy Score = 0.8575
F1 Score = 0.8557136728289723
Test score:
Accuracy Score = 0.72
F1 score = 0.714821183676838
{'n_estimators': 800, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': True}
XGBoost Classifier:
Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
Train score:
Accuracy Score = 0.865
F1 Score = 0.8622037296797982
Test score:
Accuracy Score = 0.715
F1 score = 0.712278931334429
{'tree_method': 'exact', 'min_child_weight': 1, 'lambda': 1.0, 'gamma': 1.0, 'eta': 0.1111111111111111, 'colsample_bytree': 1.0, 'alpha': 0.75}
From the hyper-parameters tuning above we can see that the models score are improving as they become less overfit. The Random Forest Classifier output the best results in predicting the target (AmountSpent) of this dataset. The test score accuracy is 0.72 means 72% of the prediction are on point, and the F1 score is 0.714 because f1 score is the harmonic mean of precision and recall it means that the model is capable enough to predict each class correctly.
The best model that we get is RandomForestClassifier with its hyper-parameters set to:
{'n_estimators': 800, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': True}
To improve the model further we can use GridSearchCV rather than RandomizedSearchCV because GridSearchCV unlike RandomizedSearchCV that only test n number of hyper-parameters combinations, GridSearchCV test all of the possible combination that we have listed. Furthermore, we can also add more hyper-parameters and assign more possible value for each hyper-parameters. In addition, the hyper-parameters for the RandomizedSearchCV could also be tweaked such as increasing the CV to ensure the model performs well on unseen data. Lastly, we can test more machine learnings models especially the more advanced one.
Aside from the model, we can perform more feature engineering and pre-procecssing on the dataset as well. For example, make the continuous features to shaped like Bell curve(normal distribution) which will improve the performance on some models. It can be done by doing feature transformation by applying functions such as log or boxcox transformation.
param = {'n_estimators': 800, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': True, 'random_state': 42}
model = RandomForestClassifier(**param)
To validate the model we will use cross validation to test these so it can perform well on unseen data. For this test we will use the entire data.
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])
print(f"Accuracy: {cross_val_score(model, X, y, scoring = 'accuracy').mean()}")
print(f"F1: {cross_val_score(model, X, y, scoring = 'f1_macro').mean()}")
Accuracy: 0.721 F1: 0.7137007674702391
As we can see our cross validation score does not differ much from the one we get before. Hence, this model is robust and good at predicting unseen data.
def show_feature_importance(model):
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
ax = feat_importances.nlargest(25).plot(kind='barh', figsize=(10, 8))
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
model.fit(X, y);
show_feature_importance(model)
As we can see above although all of the features has its importance value on predicting the target, but some features are more important than others. We can see that Salary, History, and Catalogs is the main features that is used to predicting the target. On the other hand, OwnHome and Gender does not considered as much important when predicting the target.
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from collections import Counter
df = pd.read_csv("CustomerMarketingDataset.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1000 non-null object 1 Gender 1000 non-null object 2 OwnHome 1000 non-null object 3 Married 1000 non-null object 4 Location 1000 non-null object 5 Salary 1000 non-null int64 6 Children 1000 non-null int64 7 History 697 non-null object 8 Catalogs 1000 non-null int64 9 AmountSpent 1000 non-null object dtypes: int64(3), object(7) memory usage: 78.2+ KB
cat = []
num = []
for col in df.columns:
if df[col].dtype == 'O':
cat.append(col)
else:
num.append(col)
plt.figure(figsize = (16,4))
for i in range(len(num)):
plt.subplot(1, round(len(num)/1), i+1)
sns.boxplot(y = df[num[i]], orient = 'v')
plt.tight_layout()
From the visualization above we can see that there is outlier in the salary, Clustering is very sensitive to the outlier. Hence, in this clustering model we will need to do outlier handling. Not only in the continous data but outlier handling also needs to be done in the categorical data. However, from the EDA that we have done before there are no outlier in the categorical data. Hence, we will only do outlier handling on the continous data using Z score
Clustering model is different from the multiclass classification, Hence, some of the pre-processing that needs to be done are also different.
We have gotten the EDA from above before so we know what needs to be done in clustering model. The pre-processing methods that needs to be done are: Things need to be done in preprocessing from the EDA:
mapping_age = {
'Young':0, 'Middle':1, 'Old':1
}
mapping_gender = {
'Male':0, 'Female':1
}
mapping_ownHome = {
'Rent':0, 'Own':1
}
mapping_married = {
'Single':0, 'Married':1
}
mapping_location = {
'Close':0, 'Far':1
}
mapping_history = {
'Low':0, 'Medium':1, 'High':2
}
mapping_amountSpent = {
'Low':0, 'Medium':1, 'High':2, 'Very High':3
}
maps = [mapping_age, mapping_gender, mapping_ownHome, mapping_married, mapping_location, mapping_history, mapping_amountSpent]
def label_encoder(col, mapping):
df[col] = df[col].map(mapping)
for i in range(len(cat)):
df[cat[i]] = df[cat[i]].map(maps[i])
We will use IterativeImputer because there is no information given about the missing dataset. Iterative Imputer will calculate and predict each missing data using other features that is not missing. Hence, the data that is filled will be more accurate.
Different from the multiclass classification before, there is no target columns yet in the dataset. Hence, we can use all of the columns for the IterativeImputer so the imputted value will be more accurate.
IterImputer = IterativeImputer(random_state = 42)
IterImputer.fit(df)
df_transformed = IterImputer.transform(df)
df_trans = pd.DataFrame(df_transformed, columns = df.columns)
df_trans['History'].describe()
count 1000.000000 mean 1.019872 std 0.800030 min -0.448088 25% 0.000000 50% 1.000000 75% 2.000000 max 2.490756 Name: History, dtype: float64
Because there are some values that is out of range and decimal value, we will round it then put it in range
def put_in_range(history):
val = round(history)
if val < 0:
return 0
elif val > 2:
return 2
else:
return val
df_trans['History'] = df_trans['History'].apply(put_in_range)
df_trans['History'].value_counts()
1 347 2 336 0 317 Name: History, dtype: int64
Now the value is already in the correct range
df = df_trans
df['Children'].value_counts()
0.0 462 1.0 267 2.0 146 3.0 125 Name: Children, dtype: int64
df['Catalogs'].value_counts()
12.0 282 6.0 252 18.0 233 24.0 233 Name: Catalogs, dtype: int64
outlier = num.copy()
print(f'Jumlah baris sebelum memfilter outlier: {len(df)}')
filtered_entries = np.array([True] * len(df))
for col in outlier:
zscore = abs(stats.zscore(df[col]))
filtered_entries = (zscore < 3) & filtered_entries # keep the data that have z-score less than 3
df_filtered = df[filtered_entries]
print(f'Jumlah baris setelah memfilter outlier: {len(df_filtered)}')
Jumlah baris sebelum memfilter outlier: 1000 Jumlah baris setelah memfilter outlier: 999
df_filtered
| Age | Gender | OwnHome | Married | Location | Salary | Children | History | Catalogs | AmountSpent | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15000.0 | 3.0 | 0 | 6.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13000.0 | 3.0 | 0 | 6.0 | 0.0 |
| 2 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 14600.0 | 3.0 | 0 | 6.0 | 0.0 |
| 3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 17900.0 | 3.0 | 0 | 6.0 | 0.0 |
| 4 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 12700.0 | 2.0 | 0 | 6.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 99200.0 | 0.0 | 2 | 24.0 | 3.0 |
| 996 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 110000.0 | 0.0 | 2 | 24.0 | 3.0 |
| 997 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 120800.0 | 1.0 | 2 | 24.0 | 3.0 |
| 998 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 123000.0 | 1.0 | 2 | 24.0 | 3.0 |
| 999 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 112900.0 | 0.0 | 2 | 24.0 | 3.0 |
999 rows × 10 columns
Features that will be used on the clustering model are the ones that can be categorized as RFM, so the features can be relevant for the customer segmentation. Those features are:
df.columns
Index(['Age', 'Gender', 'OwnHome', 'Married', 'Location', 'Salary', 'Children',
'History', 'Catalogs', 'AmountSpent'],
dtype='object')
cols_selection = ['Salary', 'AmountSpent', 'History']
df_selection = df_filtered[cols_selection]
Standardization using StandardScaler is used to scale the data so all of the features will have same scale. Compared to normalization, Standardization is more robust to the outlier, normalization makes the data point have a range of value from 0 to 1. On the other hand, Standardization makes the data to have a mean of 0 and standard deviation of 1. Hence, even though there is a outlier it will not matter much. Because clustering is sensitive to outliers, therefore standardization is more appropriate for the feature scaling.
df_cluster_std = StandardScaler().fit_transform(df_selection)
df_cluster = pd.DataFrame(df_cluster_std, columns = df_selection.columns)
df_cluster
| Salary | AmountSpent | History | |
|---|---|---|---|
| 0 | -1.348046 | -1.339763 | -1.260441 |
| 1 | -1.413819 | -1.339763 | -1.260441 |
| 2 | -1.361201 | -1.339763 | -1.260441 |
| 3 | -1.252676 | -1.339763 | -1.260441 |
| 4 | -1.423685 | -1.339763 | -1.260441 |
| ... | ... | ... | ... |
| 994 | 1.420982 | 1.342448 | 1.215823 |
| 995 | 1.776154 | 1.342448 | 1.215823 |
| 996 | 2.131327 | 1.342448 | 1.215823 |
| 997 | 2.203677 | 1.342448 | 1.215823 |
| 998 | 1.871525 | 1.342448 | 1.215823 |
999 rows × 3 columns
For clustering, we will use two algorithms, one of the simplest clustering algorithm namely Kmeans and the advanced one that is DBSCAN (Density-based Spatial Clustering of Applications with Noise).
Kmeans works by finding a cluster point which minimizes the distance of each data point from the cluster point. This algorithm will start by randomly placing n cluster points, after that each data point will be clustered according to the nearest cluster point. Next, it will move the cluster point and see the changes in the cluster of the data points. This process will be repeated continously until it reach convergence (no data point changes cluster anymore).
Different from Kmeans, in DBSCAN there is a concept of noise. It means there can be a data point that is not a member of any cluster. This algorithm works by calculating the density using the nearest neighbor algorithm. The area where the density is thick will be considered as a cluster, and for each threshold decrease the peak density area will merge with another and considered as one cluster.
For the scoring of this clustering we will use inertia and silhouette. The Inertia will be used in elbow method to help find the best number of clusters while the silhouette will be used as it is to know how good the clustering prediction.
Because Kmeans is a clustering algorithm which is unsupervised, we cannot use GridSearchCV to search for the best combination of hyper-parameters. However, we can use elbow_method and silhouette_score to find the best n_clusters for it.
inertia = []
silhouette = []
for i in range(2, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(df_cluster.values)
inertia.append(kmeans.inertia_)
silhouette.append(silhouette_score(df_cluster, kmeans.labels_))
In kmeans the hyperparameter that we set is n_clusters and random_state:
print(inertia)
plt.figure(figsize=(20, 10))
# plt.plot(inertia)
sns.lineplot(x=range(2, 11), y=inertia, color='purple', linewidth = 4)
sns.scatterplot(x=range(2, 11), y=inertia, s=300, color='red', linestyle='--')
plt.title('Elbow Method');
[1131.355018029066, 638.2966804799604, 521.316288184918, 417.6794370158229, 348.18474169485893, 292.2931293977714, 245.0235437567683, 205.12488160831802, 184.15133004793665]
inertia
[1131.355018029066, 638.2966804799604, 521.316288184918, 417.6794370158229, 348.18474169485893, 292.2931293977714, 245.0235437567683, 205.12488160831802, 184.15133004793665]
for i in range(len(inertia)-1):
percentage = ((inertia[i] - inertia[i+1])/inertia[i]) * 100
print(f"Penurunan Inertia = {percentage}% Jumlah cluster = {i+3}")
Penurunan Inertia = 43.58122160522722% Jumlah cluster = 3 Penurunan Inertia = 18.326962347208237% Jumlah cluster = 4 Penurunan Inertia = 19.879841378049118% Jumlah cluster = 5 Penurunan Inertia = 16.63828504881156% Jumlah cluster = 6 Penurunan Inertia = 16.052286503143108% Jumlah cluster = 7 Penurunan Inertia = 16.171979730893906% Jumlah cluster = 8 Penurunan Inertia = 16.283603418966617% Jumlah cluster = 9 Penurunan Inertia = 10.22477204907299% Jumlah cluster = 10
From the elbow method above we can see that the biggest change in Inertia is when the cluster change to 3, after that the change become insignificant. Therefore, 3 is the best number of clusters
print(silhouette)
plt.figure(figsize=(20, 10))
sns.lineplot(x=range(2, 11), y=silhouette, color='purple', linewidth = 4)
sns.scatterplot(x=range(2, 11), y=silhouette, s=300, color='red', linestyle='--')
plt.title('Silhouette Score');
[0.5026717517911004, 0.5192093002462845, 0.47535575107344324, 0.4836609814259442, 0.4951165523397102, 0.5098569196218221, 0.5298352653450185, 0.5543625315779311, 0.5573519747534983]
From the silhouette score it can also be seen that when the cluster number is 3 there is an increase in silhouette value, that means the objects is more matched with its own clusters. On the other hand, in number of clusters = 3 the silhouette value drops. Considering the insight we get from the elbow method before, the best number of clusters is 3.
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(df_cluster.values)
KMeans(n_clusters=3, random_state=42)
dbscan_params = []
cluster_counts = []
silhouette_scores = []
Because clustering is unsupervised, we cannot directly test the model on how good the model is. Therefore, GridSearchCV cannot be used; however, we can create our own function for the hyper-paramters tuning. for DBSCAN the hyper-parameters that we will tune is:
Furthermore, we can use silhouette score to undirectly see the best combination of hyper-parameters because the silhouette value gauges an object's cohesion with its own cluster in comparison to other clusters (separation). A high number on the silhouette implies that the object is well matched to its own cluster and poorly matched to nearby clusters. The silhouette has a range of -1 to +1.
def grid_search_dbscan(data, eps_vals, min_sample_vals, min_cluster, max_cluster):
for eps_val in eps_vals:
for min_sample_val in min_sample_vals:
dbscan = DBSCAN(eps = eps_val, min_samples = min_sample_val)
clusters = dbscan.fit_predict(data)
cluster_val_counts = Counter(clusters)
n_cluster = clusters.max() + 2 # +2 is for '0' and '-1'
if n_cluster >= min_cluster and n_cluster <= max_cluster:
dbscan_params.append({'eps_val':eps_val,
'min_sample_val':min_sample_val,
'n_cluster':n_cluster})
cluster_counts.append(cluster_val_counts)
silhouette_scores.append(silhouette_score(df_cluster, clusters))
grid_search_dbscan(df_cluster, np.arange(0.1, 5, 0.1), np.arange(1, 50, 1), 3, 10)
index = np.argmax(silhouette_scores)
print('Best hyper-parameters: ',dbscan_params[index])
print('Best value counts: ', cluster_counts[index])
Best hyper-parameters: {'eps_val': 0.9, 'min_sample_val': 49, 'n_cluster': 4}
Best value counts: Counter({1: 342, 2: 335, 0: 303, -1: 19})
Same with the Kmeans Clustering, DBSCAN also results in 3 clusters, with 1 additional clusters for the noise and outliers. This proves that the best number of clusters for our customer segmentation is 3 clusters.
print('Kmeans silhouette score when number of clusters = 3:', silhouette[1])
print('DBSCAN silhouette score when number of clusters = 3 + 1 noise cluster:', silhouette_scores[index])
Kmeans silhouette score when number of clusters = 3: 0.5192093002462845 DBSCAN silhouette score when number of clusters = 3 + 1 noise cluster: 0.5183572748506445
The silhouette value gauges an object's cohesion with its own cluster in comparison to other clusters (separation). A high number on the silhouette implies that the object is well matched to its own cluster and poorly matched to nearby clusters. The silhouette has a range of -1 to +1.
The two models silhouette score are around 52%, it means the each data points are adequately good placed in which those data points are well matched to its own cluster and badly matched with other clusters. However, the results can still be further improved by using more complex model and more in-depth pre-processing method. In addition, the hyper-parameter tuning can also be improved by using various other hyper-parameters.
From the result above we can see that the KMeans has a better result on clustering the data.
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(df_cluster.values);
df_cluster['label'] = kmeans.labels_
df_cluster
| Salary | AmountSpent | History | label | |
|---|---|---|---|---|
| 0 | -1.348046 | -1.339763 | -1.260441 | 0 |
| 1 | -1.413819 | -1.339763 | -1.260441 | 0 |
| 2 | -1.361201 | -1.339763 | -1.260441 | 0 |
| 3 | -1.252676 | -1.339763 | -1.260441 | 0 |
| 4 | -1.423685 | -1.339763 | -1.260441 | 0 |
| ... | ... | ... | ... | ... |
| 994 | 1.420982 | 1.342448 | 1.215823 | 1 |
| 995 | 1.776154 | 1.342448 | 1.215823 | 1 |
| 996 | 2.131327 | 1.342448 | 1.215823 | 1 |
| 997 | 2.203677 | 1.342448 | 1.215823 | 1 |
| 998 | 1.871525 | 1.342448 | 1.215823 | 1 |
999 rows × 4 columns
X = df_cluster.drop('label', axis = 1)
y = df_cluster['label']
pca = PCA(n_components = 2)
pca.fit(X)
pcs = pca.transform(X)
df_reduced = pd.DataFrame(pcs, columns = ['PC1', 'PC2'])
print('Explained variance:', pca.explained_variance_)
print('Explained variance ratio:', pca.explained_variance_ratio_)
Explained variance: [2.55845471 0.32020973] Explained variance ratio: [0.85196456 0.10662973]
pca.explained_variance_ratio_.sum() * 100
95.85942963935842
From the PCA above we can see that 95% of the variance from the data can be explained by the PCA, because it is over 80% it means we can just use PC1 and PC2 to explain and represent our features so they can be visualized in 2d plot
df_reduced['Cluster'] = y
fig, ax = plt.subplots(figsize=(15,10))
sns.scatterplot(
x="PC1", y="PC2",
hue="Cluster",
data=df_reduced,
palette=['blue','red','green'],
s=160,
ax=ax
);
df_cluster
| Salary | AmountSpent | History | label | |
|---|---|---|---|---|
| 0 | -1.348046 | -1.339763 | -1.260441 | 0 |
| 1 | -1.413819 | -1.339763 | -1.260441 | 0 |
| 2 | -1.361201 | -1.339763 | -1.260441 | 0 |
| 3 | -1.252676 | -1.339763 | -1.260441 | 0 |
| 4 | -1.423685 | -1.339763 | -1.260441 | 0 |
| ... | ... | ... | ... | ... |
| 994 | 1.420982 | 1.342448 | 1.215823 | 1 |
| 995 | 1.776154 | 1.342448 | 1.215823 | 1 |
| 996 | 2.131327 | 1.342448 | 1.215823 | 1 |
| 997 | 2.203677 | 1.342448 | 1.215823 | 1 |
| 998 | 1.871525 | 1.342448 | 1.215823 | 1 |
999 rows × 4 columns
df = pd.read_csv("CustomerMarketingDataset.csv")
for i in range(len(cat)):
df[cat[i]] = df[cat[i]].map(maps[i])
print(f'Jumlah baris sebelum memfilter outlier: {len(df)}')
filtered_entries = np.array([True] * len(df))
for col in outlier:
zscore = abs(stats.zscore(df[col]))
filtered_entries = (zscore < 3) & filtered_entries # keep the data that have z-score less than 3
df_filtered = df[filtered_entries]
print(f'Jumlah baris setelah memfilter outlier: {len(df_filtered)}')
Jumlah baris sebelum memfilter outlier: 1000 Jumlah baris setelah memfilter outlier: 999
df_insight = df_filtered[cols_selection].copy().reset_index(drop = True)
df_insight['Cluster'] = df_cluster['label']
df_insight
| Salary | AmountSpent | History | Cluster | |
|---|---|---|---|---|
| 0 | 15000 | 0 | 0.0 | 0 |
| 1 | 13000 | 0 | 0.0 | 0 |
| 2 | 14600 | 0 | 0.0 | 0 |
| 3 | 17900 | 0 | 0.0 | 0 |
| 4 | 12700 | 0 | 0.0 | 0 |
| ... | ... | ... | ... | ... |
| 994 | 99200 | 3 | 2.0 | 1 |
| 995 | 110000 | 3 | 2.0 | 1 |
| 996 | 120800 | 3 | 2.0 | 1 |
| 997 | 123000 | 3 | 2.0 | 1 |
| 998 | 112900 | 3 | 2.0 | 1 |
999 rows × 4 columns
df_insight['Cluster'].unique()
array([0, 2, 1])
df_insight[df_insight['Cluster'] == 0].describe()
| Salary | AmountSpent | History | Cluster | |
|---|---|---|---|---|
| count | 303.000000 | 303.000000 | 216.000000 | 303.0 |
| mean | 25661.386139 | 0.201320 | 0.009259 | 0.0 |
| std | 15190.141258 | 0.409811 | 0.096001 | 0.0 |
| min | 10100.000000 | 0.000000 | 0.000000 | 0.0 |
| 25% | 14800.000000 | 0.000000 | 0.000000 | 0.0 |
| 50% | 20800.000000 | 0.000000 | 0.000000 | 0.0 |
| 75% | 29600.000000 | 0.000000 | 0.000000 | 0.0 |
| max | 88600.000000 | 2.000000 | 1.000000 | 0.0 |
From the cluster data we can see that customers in cluster 0 are the customers that have low salary, low amount of spent and few purchase volume at Bee Department Store. Therefore these type of customers can be categorized as low spender customer.
df_insight[df_insight['Cluster'] == 1].describe()
| Salary | AmountSpent | History | Cluster | |
|---|---|---|---|---|
| count | 345.000000 | 345.000000 | 254.000000 | 345.0 |
| mean | 85258.840580 | 2.701449 | 1.960630 | 1.0 |
| std | 20900.276476 | 0.488975 | 0.194858 | 0.0 |
| min | 41300.000000 | 1.000000 | 1.000000 | 1.0 |
| 25% | 69100.000000 | 2.000000 | 2.000000 | 1.0 |
| 50% | 83300.000000 | 3.000000 | 2.000000 | 1.0 |
| 75% | 99300.000000 | 3.000000 | 2.000000 | 1.0 |
| max | 140700.000000 | 3.000000 | 2.000000 | 1.0 |
From the cluster data we can see that customers in cluster 1 are the customers that have very high salary, very high amount of spent and very high purchase volume at Bee Department Store. These type of customer can be categorized as high spender customer.
df_insight[df_insight['Cluster'] == 2].describe()
| Salary | AmountSpent | History | Cluster | |
|---|---|---|---|---|
| count | 351.000000 | 351.000000 | 226.000000 | 351.0 |
| mean | 53405.698006 | 1.435897 | 0.973451 | 2.0 |
| std | 19060.616734 | 0.551123 | 0.338892 | 0.0 |
| min | 13200.000000 | 0.000000 | 0.000000 | 2.0 |
| 25% | 41100.000000 | 1.000000 | 1.000000 | 2.0 |
| 50% | 49900.000000 | 1.000000 | 1.000000 | 2.0 |
| 75% | 63300.000000 | 2.000000 | 1.000000 | 2.0 |
| max | 118000.000000 | 3.000000 | 2.000000 | 2.0 |
From the cluster data we can see that the customer in cluster 2 are the customers that have medium salary, medium amount of spent and medium volume purchase at Bee Department Store. These type of customer can be categorized as medium spender customer.